Nearest neighbors does not make any mathematical assumptions or require a lot of dependencies. It only requires a way to calculate distance and the assumption that proximate points are similar. One downside is that nearest neighbors does not take all of the information ino account--it only makes predictions based on a handful of points closest to the one of interest.
In [1]:
%%capture
%run "4 - Linear Algebra.ipynb"
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
def majority_vote(labels):
"""assumes that labels are ordered from nearest to farthest"""
vote_counts = Counter(labels)
winner, winner_count = vote_counts.most_common(1)[0]
num_winners = len([count for count in vote_counts.values() if count == winner_count])
if num_winners == 1:
return winner # unique winner, so return it
else:
return majority_vote(labels[:-1]) # try again without the farthest
def knn_classify(k, labeled_points, new_point):
"""each labeled point should be a pair (point, label)"""
def key_func(point):
p, _ = point
return distance(p, new_point)
# order the labeled points from nearest to farthest
by_distance = sorted(labeled_points, key=key_func)
# find the labels for the k closest
k_nearest_labels = [label for _, label in by_distance[:k]]
# and let them vote
return majority_vote(k_nearest_labels)
In [3]:
from sklearn import datasets
iris = datasets.load_iris()
iris_0 = iris.data[iris.target == 0]
iris_1 = iris.data[iris.target == 1]
iris_2 = iris.data[iris.target == 2]
plt.scatter(iris_0[:, 0], iris_0[:, 1], label='0');
plt.scatter(iris_1[:, 0], iris_1[:, 1], label='1');
plt.scatter(iris_2[:, 0], iris_2[:, 1], label='2');
plt.legend();
In [4]:
# Classify a point that would lie at 6, 3
knn_classify(1, zip(iris.data[:, :2], iris.target), [6, 3])
Out[4]:
In [5]:
# try several different values for k
for k in [1, 3, 5, 7, 15]:
num_correct = 0
num_total = 0
for x in zip(iris.data[:, :2], iris.target):
num_total += 1
point, target = x
other_iris = [(p, t) for (p, t) in zip(iris.data[:, :2], iris.target) if p[0] != point[0] and p[1] != point[1]]
predicted_iris = knn_classify(k, other_iris, point)
if predicted_iris == target: num_correct += 1
print(k, "neighbor[s]:", num_correct, "correct out of", num_total)
You can use k-nearest to predict on novel data. Here is how our model makes prediction for the three possible target values:
In [6]:
k = 1
for x in np.arange(4, 8, 0.15):
for y in np.arange(1, 5, 0.15):
predicted_iris = knn_classify(k, zip(iris.data[:, :2], iris.target), [x, y])
color = ''
if predicted_iris == 0:
color = 'blue'
elif predicted_iris == 1:
color = 'orange'
else:
color = 'green'
plt.scatter(x, y, color=color)
In [ ]: